Translate, Predict or Generate: Modeling Rich Morphology in Statistical Machine Translation

نویسندگان

Ahmed El Kholy

Nizar Habash

چکیده

We compare three methods of modeling morphological features in statistical machine translation (SMT) from English to Arabic, a morphologically rich language. Features can be modeled as part of the core translation process mapping source tokens to target tokens. Alternatively these features can be generated using target monolingual context as part of a separate generation (or post-translation inflection) step. Finally, the features can be predicted using both source and target information in a separate step from translation and generation. We focus on three morphological features that we demonstrate through a manual error analysis to be most problematic for English-Arabic SMT: gender, number and the determiner clitic. Our results show significant improvements over a state-ofthe-art baseline (phrase-based SMT) of almost 1% absolute BLEU on a medium size training set. Our best configuration models the determiner as part of core translation and predicts gender and number separately, and handles the rest of the features through generation.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Modeling Inflection and Word-Formation in SMT

The current state-of-the-art in statistical machine translation (SMT) suffers from issues of sparsity and inadequate modeling power when translating into morphologically rich languages. We model both inflection and word-formation for the task of translating into German. We translate from English words to an underspecified German representation and then use linearchain CRFs to predict the fully ...

متن کامل

Improving the Performance of English-Tamil Statistical Machine Translation System using Source-Side Pre-Processing

Machine Translation is one of the major oldest and the most active research area in Natural Language Processing. Currently, Statistical Machine Translation (SMT) dominates the Machine Translation research. Statistical Machine Translation is an approach to Machine Translation which uses models to learn translation patterns directly from data, and generalize them to translate a new unseen text. T...

متن کامل

How to overtake Google in MT quality - the Baltic case

Motivation of the language technology company Tilde is to improve quality of machine translation for lesser resourced languages such as the languages of Baltic countries. Generic MT solutions like Google Translate perform poorly for these complex languages. To compensate the shortage of training data and to deal with rich morphology we are applying different approaches in combining statistical ...

متن کامل

The tÜBITAK-UEKAE statistical machine translation system for IWSLT 2009

We describe our Arabic-to-English and Turkish-to-English machine translation systems that participated in the IWSLT 2009 evaluation campaign. Both systems are based on the Moses statistical machine translation toolkit, with added components to address the rich morphology of the source languages. Three different morphological approaches are investigated for Turkish. Our primary submission uses l...

متن کامل

Joint Morphological-Lexical Language Modeling for Machine Translation

We present a joint morphological-lexical language model (JMLLM) for use in statistical machine translation (SMT) of language pairs where one or both of the languages are morphologically rich. The proposed JMLLM takes advantage of the rich morphology to reduce the Out-Of-Vocabulary (OOV) rate, while keeping the predictive power of the whole words. It also allows incorporation of additional avail...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2012

Translate, Predict or Generate: Modeling Rich Morphology in Statistical Machine Translation

نویسندگان

چکیده

منابع مشابه

Modeling Inflection and Word-Formation in SMT

Improving the Performance of English-Tamil Statistical Machine Translation System using Source-Side Pre-Processing

How to overtake Google in MT quality - the Baltic case

The tÜBITAK-UEKAE statistical machine translation system for IWSLT 2009

Joint Morphological-Lexical Language Modeling for Machine Translation

عنوان ژورنال:

اشتراک گذاری